We can see here that the experience column has negative values we will have to deal with.

There are negative values present in the 'Experience' column. We will have to deal with these.

There are no missing values; very good news!!!!

There is no duplicate data.

The negative values in the 'Experience' column have been taken care of.

Univariate Analysis

Income is right skewed with some outliers.

Very few customers hold a loan with the bank; which is what we are trying to change.

We see above that most customers are undergraduate

We will better organize the zip code data below.

Bivariate Analysis

The heatmap above tells us that 'Age' and 'Experience' are highy correlated, 'Income' and 'credit card average' are correlated, and 'Income' and 'Mortgage' have some correlation.

The three sets of graphs and charts below show the distrubution of outcomes yes=1 and no=2 as personal loan ownership relates to specific variables in the data set.

The graph below tells us that customers Professional level education are more likely to have a Personal Loan with the bank.

Though not much difference; we see here that customers with a Securities Account is more so to have a personal loan with the bank than a customer without one.

Customers with a CD Account are much more likely to have a Personal Loan than those without.

There is no signifigant difference in personal loan ownership between online and non online customers; shown below.

There is no signifigant difference in personal loan ownership between credit card and non credit card holders; shown below.

We can see that CCAvg, Mortgage, and Income have upper outliers. We will treat them next.

Logistic Regression

Observations

*Positive values of the coefficient shows that the probability of customer getting a Personal Loan increases with the increase of corresponding attribute value.

*Negative values of the coefficient show that the probability of customer getting a Personal Loan decreases with the decrease of corresponding attribute value.

*p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.

*But these variables might contain multicollinearity, which will affect the p-values.

*We will have to remove multicollinearity from the data to get reliable coefficients and p-values.

*There are different ways of detecting (or testing) multi-collinearity, one such way is the Variation Inflation Factor.

We see above that Age and Experience are highly correlated and essentially the same information. We will start by dropping age to see what happens to the scores.

Removing age has gotten rid of the multicollinearity above.

Above we see no significant change in performance.

In the model above the P values for all the zip codes are above .05. We will drop these as this means they are not significant .

After dropping the zip codes we see no significant change in the model, but we see there are two variables; Mortgage & Experience, remaining that have a p values greater than .05. We will drop Mortgage first to see how it changes the model, if any at all.

Building Decision Tree Model

We only have 9% of positive classes, so if our model marks each sample as negative, then also we'll get 90% accuracy,which is not a good metric here.

Insights:

True Positives:

*Reality: A customer got a Personal Loan

*Model predicted: A customer would get a Personal Loan

*Outcome: The model is good.

True Negatives:

*Reality: A customer did not get a Personal Loan
*Model predicted: The customer would not get a Personal Loan
*Outcome: The business is unaffected.

False Positives:

*Reality: A customer did not get a Personal Loan
*Model predicted: The would get a Personal Loan
*Outcome: The team which is targeting the potential customers will be wasting their resources on the people/customers who will not be contributing to the revenue.

False Negatives:

*Reality: A customer got a Personal Loan.
*Model predicted: The customer would not get a Personal Loan
*Outcome: The customer contributed to revenue but was not targeted by ads or advertisement. This results in a loss of revenue; if these customers were targeted, the bank could have sold more Personal Loans. 

Based on the confusion matrix and scores above, we need to check recall to evaluate performance of the model. As it stands; we will not be able to predict whether or not a customer will secure a Personal Loan with the bank.

Not a huge difference in the recall scores of training and test, but we will still try to make improvements

• According to the Decision Tree Model, Income is most important when predicting if a customer secures a Personal Loan.

The tree above is somwhat complex, we will attempt to simplify.

Reducing Overfitting

We see above the overfitting has been improved.

We will begin Cross Complexity Pruning

• We see that as effective alphas increase, so does total impurity of leaves

• Maximum value of Recall is at 0.04 alpha, but if we choose decision tree will only have a root node, instead we can choose alpha 0.005 retaining information for a higher recall.

Visualizing the Decision Tree

The model above has the highest recall but the bank needs more information to target customers for Personal Loans

We see above the results of the recall on the training and test data have improved with Cross Complexity Pruning.

The decision three with post-prunning is the best. Though it doesnt have the highest score off all; it provides the bank with valuable information to gain new customers.